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© Removal of swirl artifacts from celp based speech coders. 

© The perception of speech processed by a 
CELP based coder, such as a VSELP coder, 
when operating in noisy background conditions 
is improved by removing swirl artifacts during 
silence periods. This is done by removing the 
low frequency components of the input signal 
when no speech is detected. A speech activity 
detector distinguishes between a periodic sig- 
nal, like speech, and a non-periodic signal, like 
noise by using most of the VSELP coder inter- 
nal parameters to determine the speech or rt 
non-speech conditions. To prevent the VSELP ~ 
coder from determining pitches for non- 
periodic signals, a high pass filter is applied to 
the input signal to remove the pitch information 
for which the VSELP coder searches. 



CO 




GZ 



UJ 



Jouve, 18, rue Saint-Denis, 75001 PARIS 



EP 0 660 301 A1 

BACKGROUND OF THE INVENTION 

Field of the Invention 

5 The present invention generally relates to digital voice communications and, more particularly, to the re- 

moval of swirl artifacts from code excited linear prediction (CELP) based coders, such as vector-sum excited 
linear predictive (VSELP) coders, when operating in background noise consisting of low or medium levels of 
non-periodic signals. 

10 Description of the Prior Art 

Cellular telecommunications systems in North America are evolving from their current analog frequency 
modulated (FM) form towards digital systems. Digital systems must encode speech for transmission and then, 
at the receiver, synthesizing speech from the received encoded transmission. For the system to be commer- 
15 cially acceptable, the synthesized speech must not only be intelligible, it should be as close to the original 
speech as possible. 

Codebook Excited Linear Prediction (CELP) is a technique for speech encoding. The basic technique con- 
sists of searching a codebook of randomly distributed excitation vectors for that vector which produces an out- 
put sequence (when filtered through pitch and linear predictive coding (LPC) short-term synthesis filters) that 

20 is closest to the input sequence. To accomplish this task, all of the candidate excitation vectors in the codebook 
must be filtered with both the pitch and LPC synthesis filters to produce a candidate output sequence that can 
then be compared to the input sequence. This makes CELP a very computationally-intensive algorithm, with 
typical codebooks consisting of 1024 entries, each 40 samples long. In addition, a perceptual error weighting 
filter is usually employed, which adds to the computational load. 

25 A number of techniques have been considered to mitigate the computational load of CELP encoders. Fast 

digital signal processors have helped to implement very complex algorithms, such as CELP, in real-time. An- 
other strategy is a variation of the CELP algorithm called Vector-Sum Excited Linear Predictive Coding 
(VSELP). An IS54 standard that uses a full rate 8.0 Kbps VSELP speech coder, convolution^ coding for error 
protection, differential quadrature phase shift keying (QPSK) modulation, and a time division, multiple access 

30 (TDMA) scheme has been adopted by the Telecommunication Industry Association (TIA). See IS54 Revision 
A, Document Number EIA/TIA PN2398. 

The current VSELP codebook search method is disclosed in U.S. Patent No. 4,81 7,1 57 by Gerson. Gerson 
addresses the problem of extremely high computational complexity for exhaustive codebook searching. The 
Gerson technique is based on the recursive updating of the VSELP criterion function using a Gray code ordered 

35 set of vector sum code vectors. The optimal code vector is obtained by exhasutively searching through the set 
of Gray code ordered code vector set. The Eiectronnic Industries Association (EIA) published in August 1991 
the EIA/TIA Interim Standard PN2759 for the dual-mode mobile station, base station cellular telephone system 
compatibility standard. This standard incorporates the Gerson VSELP codebook search method. 

The CELP based coders, which use LPC coefficients to model input speech, work well for clean signals; 

40 however, when background noise is present in the input signal, the coders do a poor job of modelling the signal. 
This results in some artifacts at the receiver after decoding. These artifacts, referred to a swirl artifacts, con- 
siderably degrade the perceived quality of the transmitted speech. 
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SUMMARY OF THE INVENTION 



It is therefore an object of the present invention to provide an improvement in the perception of speech 
processed by a CELP based coder, such as a VSELP coder, when operating in noisy background conditions 
by removing the swirl artifacts during silence periods. 

According to the invention, the low frequency components of the input signal are removed when no speech 
so is detected, thus removing the swirl artifacts during silence periods. This results in a better perception of the 
speech at the receiver. The invention uses a voice activity detector (VAD) which distinguishes between a per- 
iodic signal, like speech, and a non-periodic signal, like noise. This VAD uses most of the VSELP coder internal 
parameters to determine the speech or non-speech conditions. More particularly, the VSELP coder tends to 
determine pitch information from a non-periodic input signal even though the actual input signal does not have 
any periodicity. This determination of pitch from a no speech signal is what generates the swirly signal artifact 
in the reproduced signal at the receiver. To prevent the VSELP coder from determining pitches for non-periodic 
signals, a high pass filter is applied to the input signal to remove the pitch information for which the VSELP 
coder searches. Removing pitch information allows only the code search process that generates the speech 
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frame information. Alternatively, the VSELP coder can be made to declare a no pitch condition and continue 
processing without pitch information. 

BRIEF DESCRIPTION OF THE DRAWINGS 

5 

The foregoing and other objects, aspects and advantages will be better understood from the following de- 
tailed description of a preferred embodiment of the invention with reference to the drawings, in which: 
Figure 1 is a block diagram of a speech decoder utilizing two VSELP excitation codebooks; 
Figure 2 is a block diagram of a speech synthesizer using two VSELP excitation codebooks and a long 
10 term filter state of past excitation; 

Figure 3 is a block diagram of the circuitry used to remove swirl artifacts from the VSELP coder; and 
Figure 4 is a block diagram showing the architecture of the voice activity detection process. 

DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT OF THE INVENTION 

15 

Referring now to the drawings, and more particularly to Figure 1, there is shown a block diagram of the 
speech decoder 1 0 utilizing two VSELP excitation codebooks 1 2 and 14 as set out in the EIA/TIA Interim Stan- 
dard, cited above. Each of these code books is typically implemented in read only memory (ROM) containing 
M basis vectors of length A/, where M is the number of bits in the codeword and N is the number of samples 

20 in the vector. Codebook 12 receives an input code / and provides an output vector. Codebook 14 receives an 
input code H and provides an output vector. Each of these vectors is scaled by corresponding gain terms 
and y 2 . respectively, in multipliers 16 and 1 8. In addition, long term filter state memory 20, typically in the form 
of a random access memory (RAM), receives an input lag code, L, and provides an output, b L (n) t representing 
the long term filter state. This too is scaled by a gain term b in multiplier 22. The outputs from the three mul- 

25 tipliers 16, 18 and 22 are combined by summer 24 to form an excitation signal, ex(n). This combined excitation 
signal is fed back to update the long term filter state memory 20, as indicated by the dotted line. The excitation 
signal is also applied to the linear predictive code (LPC) synthesis filter 26, represented by the z-transfqrm 

The transfert function of the synthesis filter 26 is time variant controlled by the short-term filter coeff i- 

30 cients a,. After reconstructing the speech signal with the synthesis filter 26, and adaptive spectral post filter 
28 is applied to enhance the quality of the reconstructed speech. The adaptive spectral postf ilter is the final 
processing step in the speech decoder, and the digital output speech signal is input to a digital-to-analog (D/A) 
converter (not shown) to generate the analog signal which is amplified and reproduced by a speaker. 

The following are the basic parameters for the 7950 bps speech coder and decoder as specified by the 

35 EIA/TIA Interim Standard: 



40 



45 



50 





sampling rate 


8kHz 




frame length 


160 samples 


N 


subframe length 


40 samples 




# bits codeword / 


7 


M 2 


# bits codeword H 


7 


a/ 


short-term filter coefficients 


38 bits/frame 


/,H 


codewords 


7+7 bits/subframe 


b. 9i. 92 


gains 


8 bits/subframe 


L 


lag 


7 bits/subframe 



Figure 2 is a block diagram of the encoder 30 for generating the codewords / and H, the lag L, and the 
gains p, y1 and y2, which are transmitted to the decoder shown in Figure 1. The encoder includes two VSELP 
excitation codebooks 32 and 34, similar to the codebooks 12 and 14. Codebook 32 receives an input code / 
and provides an output vector. Codebook 34 receives an input code H and provides an output vector. Each of 
these vectors is scaled by corresponding gain terms y, and y 2 , respectively, in multipliers 36 and 38. In addition, 
long term filter state memory 40 receives an input lag code, L t and provides an output, b L (n), representing the 
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long term filter state. This too is scaled by a gain term p in multiplier 42. The outputs from the three multipliers 
. 36, 38 and 42 are combined by summer 44 to form an excitation signal, ex(n). This combined excitation signal 
is applied to the weighted synthesis filter 46, represented by the z-transform H(z). This is an all pole filter and 

is the bandwidth expanded synthesis filter ' 1 The output of the synthesis filter 46 is the vector p'(n). The 

sampled speech signal s(n) is input to a weighting filter 48, having a transfer function represented by the z- 
transform W(z), to generate the weighted speech vector p(n). p(n) is the weighted input speech for the subf rame 
minus the zero input response of the weighted synthesis filter 46. The vectorp '(n) is subtracted from the weight- 
ed speech vector p(n) in subtractor 50 to generate a difference signal e(n). The signal e(n) is subjected to a 
sum of squares analysis in block 52 to generate an output that is the total weighted error which is input to error 
minimization process 54. The error minimization process selects the lag L and the codewords / and H t sequen- 
tially (one at a time), to minimize the total weighted error. 

The improvement to the basic VSELP coder is shown in Figure 3, to which reference is now made. The 
input signal is digitized by an analog-to-digital (A/D) converter 54 and supplied to one pole of a switch 56. The 
digitized input signal is also supplied via a high pass filter 58 to a second pole of the switch 56. The switch 56 
is controlled to select either the digitized input signal or the high pass filtered output from filter 58 by a voice 
activity detector (VAD) 60. The output of the switch 56 is supplied to the VSELP coder 62. The VAD 60 receives 
as inputs the original digitized input signal and an output of the VSELP coder 62. It will be understood that 
once the analog input signal is sampled by the A/D converter 54, typically at an 8kHz sampling rate, all proc- 
essing represented by the remaining blocks of the block diagram of Figure 3 is performed by a digital signal 
processor (DSP), such as the TMS320C5x single chip DSP. 

As described above, the VSELP coder 62 determines pitch and input signal transfer function (i.e., reflection 
coefficients). The VAD 60 uses the reflection coefficients generated by the VSELP coder 62 and the input 
signal in order to generate a decision of speech (i.e., a TRUE output) or no speech (i.e., a FALSE output). The 
TRUE output causes the switch 56 to select the digitized input signal from the A/D converter 54, but a FALSE 
output causes the switch 56 to select the high pass filtered output from high pass filter 58. More particularly, 
the VAD 60 uses the reflection coeff icients from the VSELP coder 62 in determining current frame LPC coef- 
ficients, and these LPC coefficients and previously determined LPC coefficient histories are averaged and 
stored in a buffer. The original 160 input samples are 500 Hz highpass filtered and used in determining the 
auto-correlation function (ACF) ( and this ACF and previously determined ACFs are stored in a buffer. This data 
' is used by the VAD 60 to determine whether speech is present or not The architecture of this detection process 
is shown in Figure 4, to which reference is now made. 

The input digitized speech is input to a speech buffer 64 which, in a preferred embodiment, stores 160 
samples of speech. The speech samples 65 from the speech buffer 64 are supplied to the frame parameters 
function 66 and to the residual and pitch detector function 68. The frame parameters function 66 uses the 
VSELP reflection coefficients in determining current frame LPC coefficients 67 to the pitch detector function 
68, and the pitch detector function 68 outputs a Boolean variable 69 which is true when pitch is detected over 
a speech frame. Existence of a periodic signal is determined in pitch detector function 68. The frame parame- 
ters function 66 also provides an output 70 which is the current and last three frames of the auto-correlation 
functions (ACF) and an output 71 which is five sets of LPC coefficients based on the average ACF functions. 
The output 71 is supplied to the mean residual power function 72 which, in turn, generates an output 73 rep- 
resenting the current residual power. This output 73 is input to the noise classification function 74, as is the 
Boolean variable 69. The noise classification function 74 generates as its output the noise LPC coefficients 
75 which, together with the output 70 from the frame parameters function 66, is input to the adaptive filtering 
and energy computation function 76, the output of which is the current residual power 77. The VAD decision 
function 78 generates the speech/no speech decision output 79. 

Thus, it will be appreciated that the VAD 60 is basically an energy detector. The energy of the filtered signal 
is compared with a threshold, and speech is detected whenever the threshold is detected. A FALSE output of 
the VAD 60 causes the input to the VSELP coder 62 to be from the high pass filter 58, thereby removing the 
low frequency (i.e., pitch) components of the input signal and thus removing the swirl artifacts that would other- 
wise be generated by the VSELP coder 62 during silence periods. 

While the invention has been described in terms of a single preferred embodiment, those skilled in the art 
will recognize that the invention can be practiced with modification within the spirit and scope of the appended 
claims. 
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Claims 

1. A system for the removal of swirl artifacts from a code excited linear prediction (CELP) based encoder 
(62) comprising: 

5 a switch (56) connected to receive an input signal, said input signal containing periodic and non- 

periodic signals; 

a high pass filter (58) also connected to receive said input signal and operable to remove low fre- 
quency components from said input signal, said switch being controllable to selectively supply said input 
signal or an output of said high pass filter to the CELP based encoder; and 
10 a detector (60) connected to receive said input signal and information from said CELP based en- 

coder and generate an output indicating the presence of periodic signals in said input signal, said detector 
controlling said switch to connect said input signal to said CELP based encoder when periodic signals 
are detected and to connect the output of said high pass filter to said CELP based encoder when non- 
periodic signals are detected. 

15 

2. The system recited in claim 1 wherein said CELP based encoder (62) is a vector-sum excited linear pre- 
dictive (VSELP) speech encoder (62). 

3. The system recited in claim 1 or 2 wherein said detector receives reflection coefficients (66) from said 
CELP based encoder and determines an energy level (76) of said input signal in order to make a deter- 
mination of the presence of periodic signals in said input signal. 

4. The system of claim 1 , 2 t or 3 wherein said periodic signals are speech-like and said non-periodic signals 
are noise-like and wherein said detector (60) is a voice activity detector (VAD). 

25 5. The system of claim 1 , 2, 3, or 4 wherein said low frequency components removed by said high pass filter 
correspond to pitch information. 

6. The system of claim 1, 2 t 3, 4, or 5 further comprising a control gate connected to the detector and the 
CELP based encoder for instructing the CELP based encoder to encode filtered input signals without pitch 

so information when non-periodic signals are detected and to encode input signals with pitch information 

when periodic signals are detected. 

7. A method for the removal of swirl artifacts from a code excited linear prediction (CELP) based speech 
encoder (62) comprising the steps of: 

35 sampling an input signal and converting input signal samples to digital values (54), said input signal 

containing periodic and non-periodic signals, said periodic signals being speech-like signals and said non- 
periodic signals being noise-like signals; 

high pass filtering (58) said digital values of the input signal to remove low frequency components 
from samples of the input signal, said low frequency components corresponding to pitch information; 

40 determining the presence of speech-like signals in said input signal using a voice activated detector 

(VAD) (60) connected to receive said digital values of the input signal and information from said CELP 
based speech encoder; and 

selectively supplying (56) said digital values of the input signal or high pass filtered digital values 
to the CELP based speech encoder, said digital values of the input signal being connected to said CELP 

45 based speech encoder when speech-like signals are detected and the high pass filtered digital values 

being connected to said CELP based speech encoder when noise-like signals are detected. 

8. The method of claim 7 further comprising: 

selectively causing said CELP based speech encoder to declare a no pitch condition when noise- 
so like signals are detected by said VAD, said CELP based speech encoder continuing to process digital val- 
ues of the input signal without pitch information, but when speech-like signals are detected by said VAD, 
said CELP based speech encoder resuming processing of digital values of the input signal with pitch in- 
formation. 
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